Dyr og Data

Data wrangling ‘hands on’

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-09-04

Learning objectives

At the end of this activity you should be able to

  • Understand the main dplyr verbs for data wrangling

  • Be able to apply these verbs to data

Meet your data set

  • rows are observations
  • columns are variables
  • colour distinguishes the variables
  • the number of sides on the shapes on the faces of the cubes are the values of the unique observations on a specific variable

Data frame

How many rows?

How many columns?

How many observations?

This is one observation

What are the data values?

Looking at the green, how many sides does the observation have?

Questions?

  1. How many observations are there in your data set?

  2. What are the unique values for the data in this data set?

  3. How many variables are there in your data set?

  4. What names would you give the variables?

05:00

Operators

As well as <- R has many operators

  • Mathematical

    • +
    • -
    • *
    • /
  • Boolean (logical)

    • < and >
    • <= and >=
    • ==
    • !=
    • & AND
    • | OR
    • ! NOT
    • x %in% y is x contained in y
    • is.na(x) and !is.na(x)

filter

filter

Take your data frame and then filter it (only include rows where) the red column only includes observations with three sides (triangles) OR the green column only includes observations with more than 4 sides (pentagons, hexagons)

05:00

How did you do?

Load some packages

library("tibble")
library("dplyr")

Your virtual data set

data <- tribble(
  ~red, ~orange, ~yellow, ~green, ~blue, ~purple,
  3, 6, 3, 5, 4, 5,
  5, 6, 4, 4, 4, 6,
  4, 3, 6, 5, 3, 5
)

data
# A tibble: 3 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     3      6      3     5     4      5
2     5      6      4     4     4      6
3     4      3      6     5     3      5

filter

data |>         # take your data frame, and then
  filter(       # filter it (only include rows where)
    red == 3 |  # red column only includes obs with 3 sides, OR
    green > 4   # green column only includes obs with more than 4 sides
  )
# A tibble: 2 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     3      6      3     5     4      5
2     4      3      6     5     3      5

Your turn

Take your data frame and then filter it (only include rows) where the yellow column only includes observations with four sides (squares) OR the blue column only includes observations with more than 3 sides (squares, pentagons, hexagons)

05:00

filter

data |>            # take your data frame, and then
  filter(          # filter it (only include rows where)
    yellow == 4 |  # yellow column only includes obs with 3 sides, OR
    blue > 3      # blue column only includes obs with more than 4 sides
  )
# A tibble: 2 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     3      6      3     5     4      5
2     5      6      4     4     4      6

Your turn

Take your data frame and then filter it (only include rows where) the yellow column only includes observations with more than 3 sides AND the blue column only includes observations with more than 3 sides

05:00

filter

data |>            # take your data frame, and then
  filter(          # filter it (only include rows where)
    yellow > 3 &  # yellow column only includes obs with more than 3 sides, AND
    blue > 3      # blue column only includes obs with more than 3 sides
  )
# A tibble: 1 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     5      6      4     4     4      6

& (AND) is the default in `filter()

data |>            # take your data frame, and then
  filter(          # filter it (only include rows where)
    yellow > 3,   # yellow column only includes obs with more than 3 sides, AND
    blue > 3      # blue column only includes obs with more than 3 sides
  )
# A tibble: 1 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     5      6      4     4     4      6

select

select

Take your data frame and then select it (only include columns) red, yellow, and green columns

03:00

How did you do?

select

data |>                 # take your data frame, and then
  select(               # filter it (only include)
    red, yellow, green  # the red, yellow, and green columns
  )
# A tibble: 3 × 3
    red yellow green
  <dbl>  <dbl> <dbl>
1     3      3     5
2     5      4     4
3     4      6     5

Your turn

Take your data frame and then select it (only include columns) to exclude the green column

03:00

How did you do?

select

data |>                 # take your data frame, and then
  select(               # filter it (exclude)
    -green              # the green column
  )
# A tibble: 3 × 5
    red orange yellow  blue purple
  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
1     3      6      3     4      5
2     5      6      4     4      6
3     4      3      6     3      5

mutate

mutate

Take your data frame and then mutate it purple, is c(4, 4, 5)

03:00

How did you do?

mutate

data |>                 # take your data frame, and then
  mutate(               # mutate it (create / modify columns)
    purple = c(4, 4, 5) # to make purple variable have values 3, 4, 5.
  )
# A tibble: 3 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     3      6      3     5     4      4
2     5      6      4     4     4      4
3     4      3      6     5     3      5

arrange

arrange

Take your data frame and then arrange it (sort it) in ascending order using the values of the red column

Take your data frame and then arrange it (sort it) in descending order using the values of the green column

03:00

How did you do?

arrange

data |>                 # take your data frame, and then
  arrange(red)          # arrange it in ascending order by red
# A tibble: 3 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     3      6      3     5     4      5
2     4      3      6     5     3      5
3     5      6      4     4     4      6
data |>                 # take your data frame, and then
  arrange(desc(green))  # arrange it in descending order by greed
# A tibble: 3 × 6
    red orange yellow green  blue purple
  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1     3      6      3     5     4      5
2     4      3      6     5     3      5
3     5      6      4     4     4      6

summarise

summarise

Take your data frame and then summarise it (create a table of summaries) with the maximum value of purple

Take your data frame and then summarise it (create a table of summaries) with the:

  • minimum value of red
  • minimum value of green
  • maximum value of orange
05:00

How did you do?

summarise

data |>                   # take your data frame, and then
  summarise(max(purple))  # create a summary of it to show the maximum of purple
# A tibble: 1 × 1
  `max(purple)`
          <dbl>
1             6
data |>                   # take your data frame, and then
  summarise(              # create a summary of it to show
    min(red),             # the minimum of red
    min(green),           # the minimum of green
    max(orange)           # the minimum of orange
  )
# A tibble: 1 × 3
  `min(red)` `min(green)` `max(orange)`
       <dbl>        <dbl>         <dbl>
1          3            4             6